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Abstract 


This  paper  carefully  examines  the  current  status  of  the  statistical  pattern 
recognition  by  the  topics:  classification  rules,  feature  extraction,  contexttial 
analysis,  etc.  Important  but  unsolved  problem  areas  ere  also  explored.  The 
relationship  betveen  the  statistical  pattern  recognition  and  signal  processing 
is  eilso  considered. 
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A Review  of  Statistical  Pattern  Recognition 
C.  H.  Chen 

I.  Introduction 

After  more  than  twenty  years  of  progress,  the  theory  and  applications  of 
statistical  pattern  recognition  are  now  well  developed.  A number  of  textbooks 
[1-11]  have  been  available.  The  limitations  of  the  statistical  pattern  recognition 
are  also  evident;  the  patterns  are  not  characterized  by  the  statistical  Information 
alone  auad  many  useful  statistical  properties  cannot  be  fully  exploited  with  avail- 
able mathematical  statistics.  Like  many  other  fields  there  is  a wide  gap  between 
theory  €uid  practice.  The  limitation  of  the  finite  sample  size  is  mainly  responsible 
for  such  a gap.  The  finite  sample  size  effect  is  the  one  among  ten  problem  areas 
[12]  in  statistical  pattern  recognition  for  which  the  solutions  are  much  needed. 

In  this  paper  the  current  status  of  the  statistical  pattern  recognition  is 
reviewed  by  topics  including  classification  rules,  feature  extrauition,  contextual 
analysis,  supervised  and  unsupervised  learning  and  clustering,  finite  sample  size 
effects,  and  computational  recognition  complexity.  Other  Important  but  tinsolved 
problem  areas  are  examined.  The  relationship  between  the  statistical  pattern 
recognition  and  signal  processing  is  also  considered. 

II.  The  Classification  Rules 

Statistical  pattern  recognition  makes  use  of  the  decision  theoretic  approach 
to  pattern  recognition.  The  fundamental  assumption  is  that  the  pattern  are  random 
in  natiire  and  thus  can  be  described  statistically  in  parametric  or  nonparametrlc 
forms.  The  recognition  problem  essentially  consists  of  preinrocesslng,  featvire 
extraction  and  selection,  and  classification  (decision  making)  eilong  with  training 
or  learning  process.  A good  classification  is  almost  always  the  main  objective  of 
a recognition  system.  Two  most  well  known  statistical  classification  rules  are 
the  Bayes  decision  rule  and  the  nearest-neighbor  decision  nile. 
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Let  X be  a vector  measurement  of  a pattern  sample,  and  m be  the  number  of 
classes.  The  Bayes  decision  rule  minimizes  the  average  risk  with  respect  to  the 
given  a priori  probabilities  i = l,2,...,m.  For  equal  loss  functions,  the 
Bayes  decision  rule  reduces  to  the  maximum  likelihood  decision  rule  (MLDR)  which 
chooses  the  class  that  maximizes  the  function 
p(x/uj^);  i = l,2,...,m 

\rtiere  the  conditional  probability  densities  pCx/u^^)  must  be  known  or  estimated. 

The  oirtlmal  property  of  the  Bayes  decision  rule  is  not  always  realized  in  practice 
because  the  required  a priori  knowledge  is  either  unavailable  or  Inaccvirate.  For 
two  multivariate  Gaussian  densities  with  mean  and  covariance  i = 1,2,  the 
MLDR  is  to  assign  x to  the  class  for  which 

(x  - Pj)'  (x  - pj^)  - )ln(P^/|  ) (l) 

is  the  minimum.  It  is  not  imusual  to  find  in  practice  [13]  that  a modified  MLDR 
which  chooses  the  minimum  of  the  form, 

(x  - p^)'  (x  - p^)  (2) 

can  perform  better  than  the  MLDR.  This  is  an  example  of  the  gap  between  theory 
and  practice.  The  performance  of  the  Bayes  decision  rule  or  the  Bayes  error 
probability  in  general  cannot  be  expressed  with  a closed  form.  The  error  estimate 
which  critically  edpends  on  the  sample  size  is  by  itself  an  fundamental  problem  in 
statistics  (see  e.g.  [lU]) 

The  neeurest  neighbor  decision  rule  (NRDR)  identifies  the  vector  sample  x with 
the  class  of  its  nearest  neighbor;  nearness  being  measured  by  the  Euclidean  distance. 
For  k-NBDR,  the  decision  is  based  on  the  majority  vote  of  k nearest  neighbors.  The 
advantage  of  the  NBDR  is  that  its  asymptotic  error  rate  is  upper  bounded  by  twice 
of  the  Bayes  error.  The  HNDR  is  nonparametrlc  because  the  information  on  probability 
densities  is  not  needed.  An  obvious  drawback  of  the  BBDR  is  that  an  extensive 


amount  of  distance  computation  is  required.  Procedures  to  reduce  the  cooptation 
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Include  the  condensed  NNDR,  edited  NNDR,  selection  of  training  samples,  and  the 
use  of  branch  and  bound  edgorlthms.  Other  modifications  of  the  NNDR  include  the 
distance  weighted  NNDR  which  can  provide  better  recognition  result  in  practice, 
than  the  -jnweighted  NNDR  discussed  above.  Replacement  of  the  Euclidean  distance 
by  the  quadratic  form  given  by  Eq.  (2)  also  demonstrated  a superior  recognition 
performance  in  practice  (151.-  The  performance  of  the  NNDR  at  small  sample  size  is 
not  clear  as  the  limited  available  theoretical  results  are  inconclusive.  For 
moderate  to  large  sample  size,  the  NNDR  performance  is  comparable  to  the  MLDR. 

The  reject  option  has  been  considered  for  both  Bayes  decision  rule  and  the 
NNDR.  The  errors  can  be  reduced  at  the  exx>ense  of  some  rejects.  The  error- 
reject  trade-off  is  an  additional  consideration  in  the  reject  option  (see  [l6]  for 
recent  resxilt). 

Linear,  piecewise  linear,  and  quadratic  discriminant  functions  have  been 
extensively  Investigated  especially  In  the  statistical  literatures.  However, 
the  closed  form  error  probability  expressions  are  generally  unavailable  except  in 
the  simple  case  of  multivariate  Gaussian  densities  with  unequal  mean  and  equal 
covariance  matrices.  The  use  of  the  MLDR  is  Implied  for  the  parametric 
discriminant  analysis  and  the  optimization  criterion  is  the  mlnlmxim  error 
probability.  The  Fisher's  linear  discriminant  is  a nonparametrlc  technique  that 
maximizes  the  ratio  of  between-class  scatter  to  wlthin-class  scatter  in  the  one- 
dlmenslonad.  space  on  which  the  vector  measurements  are  projected.  This  projection 
is  a many-to-one  mapping  and  in  theory  cannot  possibly  reduce  the  minimum 
attainable  error  probability. 

For  complex  patterns  such  as  images,  a multi-stage  decision-tree  classifier 


has  been  shown  experimentally  to  have  a better  overall  performance  than  the 
conventional  single-stage  classifier  [171(18].  However,  the  classification  time 
increases  due  to  the  complexity  of  computation.  A linear  binary  tree  classifier 
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can  be  used  [19]  to  take  advantage  of  the  accuracy  of  a decision-tree  classifier 
and  to  use  the  linear  discriminemt  function  at  decision  stages  to  reduce  the 
classification  time.  With  pre-designed  tree  structure,  the  overall  computation 
time  can  be  less  than  ten  percent  of  that  of  a single-stage  classifier.  Although 
different  feature  subset  may  be  used  at  each  decision  stage,  the  search  for  an 
otpimum  feature  subset  requires  additional  computation.  The  problem  of  optimizing 
the  decision  tree  structure  has  been  considered  (see  e.g.  [20]).  The  methods  of 
reducing  the  computational  complexity  considered  Include  clustering  the  decision 
imles,  and  the  use  of  branch  and  brand  procedure  to  find  efficient  decision  rule 
and  for  feature  assignment,  etc.  The  decision  tree  classifier  is  the  most  promising 
classification  mechanism  for  increasingly  complex  recognition  problems  in  the 
future.  Features  can  be  mathematical,  structiiral  or  various  ccmbinations. 

Although  the  sequential  decision  procedure  is,  theoretically  speaking,  suitable 
mainly  for  Independent  identically  distributed  measurements,  the  flexibility 
allowed  by  feature  ordering  or  even  on-line  feature  ordering  is  the  most  attractive 
capability  of  the  sequential  decision  procedure. 

The  table  look-up  decision  rule  stores  the  decision  rule  itself  rather  than 
the  densities.  The  vector  measurement  x is  used  as  an  address  to  a table  which 
look-up  the  class  assignment  for  x.  The  table  which  is  stored  in  the  memory 
assigns  a class  to  each  (quantized)  vector  in  the  measurement  space.  Procedures 
to  reduce  the  memory  requirements  and  to  speed-up  the  decision  assignment  time 
have  been  considered  ([21] [22]). 

Other  generalization  of  the  conventions^  decision  theory  framework  is  the 
slmultsmeous  membership  of  a measurement  in  several  classes  which  has  the  origin 
of  "degree  of  membership"  from  fuzzy  set  theory.  The  compound  decision  rules  and 
the  finite  sample  size  effects  in  sample-based  classification  rules  will  be 
discussed  in  later  sections. 
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III.  Feature  Extraction 

The  mathematical  features  as  well  as  the  structural  features  are  best  suited 
for  automatic  recognition  although  they  nay  not  necessarily  have  physical  meaning 
or  may  be  quite  different  from  features  derived  by  hxnnan  recognition  process.  A 
fundamental  approach  to  extract  features  in  statisticeil  pattern  recognition  Is 
by  evaluating  a number  of  available  features  to  select  a small  subset  of  good 
feattires.  Such  evaluiation  can  be  based  on  the  direct  estimate  of  the  error 
probability.  Many  feature  selection  criteria  have  been  proposed  for  feature 
evaluation  including  various  distance  and  information  measures  (see  e.g.  [23] [2^]). 
These  measurements  are  very  effective  even  though  they  do  not  always  choose  the 
feature  set  that  has  the  smallest  error.  The  relative  effectiveness  of  various 
measures  has  been  considered  [251.  These  measiires  are  also  very  usefiil  for  error 
estimates  [26]. 

Another  useful  approach  is  the  linear  treinsformation  methods.  If  a pattern 
C6U1  be  completely  described  by  the  second  order  statistics,  the  Karhuraen-Loeve 
transform  is  optimal  in  the  mean  square  error  sense.  In  addition  to  the  fact  that 
the  second  order  statistics  is  not  adquate  for  most  patterns,  the  transform  also 
requires  excessive  computation.  It  is  a misconception  that  feature  extraction  is 
nothing  more  than  dimensionality  reduction  and  that  the  Karhunen-Loeve  transform 
solves  edl  mathematical  feature  extraction  problems. 

A realistic  solution  to  featvure  extraction  must  take  into  consideration  the 
nature  of  patterns,  the  a priori  knowledge  available,  and  the  specific  requirements 
and  constraints  of  the  given  recognition  task.  Although  exhaustlc  search  is  about 
the  only  way  to  find  the  best  feature  set,  efficient  feature  set  search  pro- 
cedures are  most  needed  [27]  to  provide  a computationally  feasible  solution. 

Feature  extraction  and  selection  is  important  not  only  for  pattern  recognition 
but  also  useful  to  signal  processing  emd  communications.  Properly  selected 
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feat\ire  subset  cam  represent  a compression  of  the  original  signed,  so  that  the 
transmission  requirement  such  as  bandwidth  cam  be  greatly  reduced.  However, 
feature  selection  differs  from  signal  selection  in  commimicatlons  in  one  Important 
aspect;  the  additive  white  noise  usuailly  does  not  apply  to  the  pattern  recognition 
problem.  To  extract  the  right  features  that  truly  chara.,terize  a pattern  is  a 
reail  challenge  to  humem  intelligence.  Although  much  has  been  studied,  feature 
extraction  will  remain  to  be  a key  problem  in  pattern  recognition. 

IV.  Contextual!  Analysis 

A major  weakness  of  statistlcail  pattern  recognition  is  the  difficulty  to  take 
the  contextual  relations  into  account  in  the  recognition  process.  The  compound 
decision  theory  appeairs  to  be  the  closest  statistical  theory  that  can  take  the 
contextual  Information  into  account.  When  a statistical  decision  problem  is 
repeated  n times,  with  no  relationships  among  the  individual  problems,  the  compound 
decision  rule  makes  use  of  the  infonnation  frcan  all  measurements  from  the  n 
repetitions  to  make  decisions  on  individual  problems.  In  character  recognition  of 
a text,  for  example,  decisions  have  to  be  made  on  individual  characters.  The 
contextual  information  in  terms  of  transition  probabilities  among  characters  can 
be  utilized  to  improve  the  recognition  for  individual  characters.  Similarly  in 
image  recognition,  individual  picture  elements  or  subimages  may  have  to  be 
classified.  The  information  on  the  correlation  among  picture  elements  or  sub- 
images  shOTild  be  used  for  better  classification.  Although  very  little  theoretical 
res-ult  is  available  to  measure  the  amount  of  performance  improvement  due  to  the 
use  of  contextual.  Information,  experimental  results  have  all  demonstrated  the 
available  improvement.  To  Implement  the  compound  decision  rule,  Markov  chain, 
model  of  stationary  stochastic  process  for  the  pattern,  and  coding  of  spatial 
correlation  parameters  [28]  are  among  the  useful  tools. 


Consider  the  recognition  of  each  subimage  of  an  image.  By  assuming  dependence 
only  on  four  adjacent  subimages,  the  compound  decision  rule  is  to  choose  the  class 
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(3) 


'v^lch  maximizes 

k 

n p(xj/u)j^) 

J*"X 

where  = l,2,...,m  and  x^  is  the  vector  measurement  of  the  subimage  under 
consideration.  If  ve  assume  the  dependence  on  all  eight  neighboring  subimages, 
then  the  expression  inside  the  product  sign  should  have  the  conditional  probability 
densities  of  all  eight  neighbors.  Experimental  res\ilt  has  demonstrated  [29]  that 
there  is  very  little  performance  difference  between  fo\ir  and  eight  neighbors. 

While  there  is  very  much  to  be  done  in  image  recognition  using  the  contextual 
information  to  classify  a whole  image  or  individual  subimages  (or  pictiire  elements), 
there  has  been  very  significant  progress  in  the  character  recognition  area  (see 
e.g.  [30][31]). 


V.  Supervised  euid  Unsupervised  Learning  and  Clustering 

Learning  is  needed  in  pattern  recognition  to  establish  the  required  statistical 
knowledge,  from  samples,  such  as  the  statistical  parameters,  probability  densities, 
or  even  the  decision  boundsu-ies.  When  the  samples  are  of  known  classification, 
learning  is  supervised;  otherwise  it  is  vinsupervised.  In  terms  of  the  statistical 
framework,  the  supervised  learning  follows  exactly  the  classical  Bayesian  emd 
maximum  likelihood  estimation  theories. The  mixture  estimation  and  decomposition 
in  statistics  is  one  approach  to  unsupervised  learning.  Iluch  details  on  the 
learning  8j.gorlthms  as  well  as  the  decision-directed  learning  are  available  in 
pattern  recognition  texts  [1-11].  It  is  important  to  note  that  the  criterion  of 
minimizing  the  mean-square  error  between  the  estimated  and  true  parameters  is  used 
almost  exclusively  in  learning  and  estimation.  While  the  objective  of  claasifl-  ' 
cation  is  the  minimum  error  probability,  there  is  no  guarantee  that  the  learning 
algorithms  will  result  in  minimum  classification  error.  Some  effort  has  been  made 
to  design  learning  algorithms  using  window  functions  to  minimize  directly  the 
classification  error  [32].  However  the  convergence  rate  may  be  slow.  In  addition 
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to  properly  selecting  the  window  parameter,  other  procedures  should  he  examined 
to  speed  up  the  convergence.  A good  understanding  of  the  relationship  [33] 
between  estimation  and  decision  Is  necessary.  More  flexible  structures  for  the 
learning  process  should  be  considered.  For  example,  the  Initial  learning  phase 
may  be  the  conventional  minimum  mean-square  error  criterion.  The  subsequent 
learning  phase  csin  be  based  on  the  minimum  error  probability  criterion.  Another 
example  is  that  a 3Ui>ervised  learning  process  can  be  switched  to  unsupervised 
learning  or  vice  versa.  Of  course  the  optim^lm  usage  of  each  learning  phase  would 
be  a new  problem  to  be  examined  [3*+]. 

Clustering  Is  an  important  subject  by  Itself  in  statistical  data  analysis, 
although  it  may  be  considered  as  unsupervised  learning  In  pattern  recognition. 
Clustering  can  be  defined  as  a partition  of  the  set  of  vector  measurements  such 
that  each  measiirement  will  be  assigned  to  one  and  only  one  set  among  a collection 
of  disjoint  sets.  A recent  discussion  on  the  subject  is  in  [35],  in  addition 
to  the  texts  [1-11] . The  problem  of  clustering  individuals  can  be  considered 
within  the  context  of  a mixture  of  distributions  136].  Discussion  of  the  cluster 
validity  problem  is  in  [37]. 

VI.  Finite  Sample  Size  Effects 

In  practical  recognition  problems  the  sample  size  is  limited.  The  actual 
recognition  performance  may  be  quite  different  from  that  theoretically  predicted 
based  on  infinite  sample  size.  Indeed  the  finite  sample  size  and  its  associated 
dlmenslcnality  problran  is  fundamental  to  all  pattern  recognition  problems.  For 
example,  the  decision  r\iles  in  practice  sure  sample-based.  Expected  errors  of  the 
sample-based  classification  rules  generally  do  not  have  closed  form  solution  at 
small  sample  size.  Distance  and  information  measures  evaluated  under  finite 
sample  size  may  be  highly  inaccurate.  A general  discussion  of  the  finite  learning 
sample  size  problem  is  in  [38] [391 [^0]  among  others. 
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The  best  way  to  reduce  the  finite  sample  size  effect  is  to  increase  the 
sample  size  with  respect  to  the  dimensionality.  For  images  the  dimensionality 
Includes  the  numbers  of  plctxire  elenents  and  the  quantization  levels.  The  relation- 
ships among  the  performance,  samp.,  size,  and  dimensionality  are  highly  nonlinear. 

In  general  when  the  sample  size  is  moderately  Isurge  to  large,  the  effects  of  finite 
sample  size  are  not  very  significant.  A thorough  study  of  the  subject  is  much 
needed  as  it  will  certainly  be  helpful  to  design  a reliable  recognition  system 
for  a given  set  of  features. 

VII.  Ccmputational  Recognition  Ccmplexity 

The  term  "cominitatlon  complexity"  has  a different  meaning  at  different 
situations  and  is  not  well  defined  for  pattern  recognition  researchers.  The 
Kolmogorov  information-theoretic  computational  complexity  is  defined  as  the 
minimum  length  of  the  program  to  obtain  an  object  from  data.  While  in  linear 
discrimination  the  complexity  of  the  classifier  is  usually  identified  with  the 
dimensionality  of  the  vector  measTirement , the  discriminating  capability  of  Boolean 
classifiers  is  determined  not  only  by  dimensionality  of  the  feature  vectors  but 
also  by  the  type  of  combinations  these  features  are  permitted  to  \nidergo.  In  this 
case  we  talk  about  the  combinational  complexity  of  the  decision  rule.  Intuitively 
the  complexity  concept  can  give  us  a feeling  of  what  is  complex  and  what  is  less 
complex.  So  the  complexity  should  be  a relative  not  an  absolute  measure.  A more 
familiar  complexity  definition  to  engineers  is  the  amount  of  computational  effort 
including  time  «ind  cost  to  accomplish  a recognition  task.  To  be  machine  independent, 
the  complexity  will  Include  mainly  the  nimber  of  manipulations  such  as  the 
multiplication  and  comparison  operations.  The  recognition  complexity  based  on 
this  definition  can  be  reduced  by  proper  Implementation  techniques  such  as  the 
use  of  sequential -parallel  operations,  etc. 


-10- 


For  the  overall  recognition  complexity  of  a recognition  system,  the  trade-off 
■between  feature  extraction  and  classification  must  be  considered.  A complicated 
feature  extraction  process  results  in  a few  but  good  feati^res.  The  resulting 
classifier  can  be  a very  simple  one.  If  no  feature  extraction  effort  is  made  so 
that  a large  number  of  features  are  used,  the  required  classification  and  learning 
process  will  be  very  complicated.  The  problem  of  determining  an  optimxm  overall 
recognition  time  has  not  been  considered.  The  solution  to  this  problem  should  be 
particularly  useful  for  realtime  pattern  recognition. 

VIII.  Other  Problem  Areas 

In  addition  to  the  topics  considered  above,  there  are  a number  of  other 
problem  areas  where  the  solutions  are  partially  available  or  compl.~'*:e'!.y  unavailable. 

1.  Learning  and  classification  of  nonstationary  patterns.  Only  special  cases 
were  examined. 

2.  A truly  optimal  recognition  system  that  optimizes  Jointly  the  preprocessing, 
feature  extraction,  and  classification  and  learning.  Solution  is  not  available. 

3.  Statistical  and  syntactic  mixed  model.  Much  has  been  said  but  little  success 
is  reported. 

L.  Automatic  generation  of  recognition  rules.  No  solution  is  available. 

5.  Interactive  pattern  recognition.  A very  significant  progress  has  been  made 
to  provide  man-machine  interaction  in  pattern  recognition. 

IX.  Relationships  with  Signal  Processing 

Many  statistical  pattern  recognition  techniques  such  as  feature  extractin' 
and  classification  can  be  considered  as  "nonlinear"  signal  processing.  On  the 
other  hand  many  digital  signal  processing  techniques  are  especially  needed  for  the 
preprocessing  phase  of  the  recognition  process.  However,  in  signal  processing 
the  emphasis  is  on  manipulation  of  patterns  of  a single  class  while  in  pattern 
recognition  the  emphasis  is  on  the  difference  among  the  patterns  from  several 
classes.  Integration  of  processing  and  recognition  inco  one  system  has  been 
necessary  in  many  applications. 
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